Skip to content

feat: add RoboSpatial task#1347

Open
njb-nvidia wants to merge 1 commit into
EvolvingLMMs-Lab:mainfrom
njb-nvidia:add-robo_spatial-task
Open

feat: add RoboSpatial task#1347
njb-nvidia wants to merge 1 commit into
EvolvingLMMs-Lab:mainfrom
njb-nvidia:add-robo_spatial-task

Conversation

@njb-nvidia
Copy link
Copy Markdown
Contributor

Summary

Adds RoboSpatial, a spatial-reasoning benchmark for robotic manipulation scenes (RoboSpatial-Home) covering three sub-categories:

  • compatibility — 105 items
  • configuration — 123 items
  • context — 122 items

Total: 350 items.

This port exposes:

  • `robo_spatial` (group)
  • `robo_spatial_all` (union of all three splits via `dataset_kwargs.data_files` with `verification_mode: no_checks`)
  • `robo_spatial_compatibility` / `robo_spatial_configuration` / `robo_spatial_context` (single-category sub-tasks via `_default_template.yaml`)

Metric: `robo_spatial_score` — task-specific scoring (point / region / affordance correctness; see `pre_process.py` for parsing).

Files

  • `lmms_eval/tasks/robo_spatial/_default_template.yaml` — shared task config.
  • `lmms_eval/tasks/robo_spatial/robo_spatial.yaml` — group definition.
  • `lmms_eval/tasks/robo_spatial/robo_spatial_all.yaml` — concatenated test split.
  • `lmms_eval/tasks/robo_spatial/robo_spatial_{compatibility,configuration,context}.yaml` — per-category tasks.
  • `lmms_eval/tasks/robo_spatial/utils.py` — doc transforms, scoring, aggregation.
  • `lmms_eval/tasks/robo_spatial/pre_process.py` — answer parsing helpers.

Parity vs. local fork

Qwen3-VL-2B-Instruct, full test split on 8x H100, greedy decoding.

Source Compat Config Context Overall (350)
Fork 0.610 0.675 0.320 0.5314
Upstream 0.629 0.732 0.320 0.5571

Per-doc analysis on the 309 shared questions matched by doc_id: 91.9% identical scores.

Delta (+2.6pp overall) is consistent with the qwen3_vl model-class drift we have observed on other ports (e.g. metavqa, egoplan2).

Test plan

  • `uv run lmms-eval --tasks robo_spatial_all --limit 8` smoke
  • Full run on 8x H100 with Qwen3-VL-2B-Instruct; per-category scores match the fork within noise
  • `combined` split assembly via `dataset_kwargs.data_files + verification_mode: no_checks` verified end-to-end (350 docs loaded as expected)

RoboSpatial is a spatial-reasoning benchmark for robotic manipulation
scenes (RoboSpatial-Home) covering three sub-categories:
compatibility, configuration, and context.

Dataset: chanhee-luke/RoboSpatial-Home on HuggingFace.
Per-category splits: compatibility (105), configuration (123), context (122)
(350 items total).

This port exposes:
  - robo_spatial (group)
  - robo_spatial_all (union of all three splits via dataset_kwargs.data_files)
  - robo_spatial_compatibility / robo_spatial_configuration / robo_spatial_context

Metric: robo_spatial_score — task-specific scoring implemented in utils.py
(point/region/affordance correctness; see pre_process.py for parsing).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant